Validity of age estimation methods and reproducibility of bone/dental maturity indices for chronological age estimation: a systematic review and meta-analysis of validation studies

Several approaches have been developed to estimate age, an important aspect of forensics and orthodontics, using different measures and radiological examinations. Here, through meta-analysis, we determined the validity of age estimation methods and reproducibility of bone/dental maturity indices used for age estimation. The PubMed and Google Scholar databases were searched to December 31, 2021 for human cross-sectional studies meeting pre-defined PICOS criteria that simultaneously assessed the reproducibility and validity. Meta-estimates of validity (mean error: estimated age-chronological age) and intra- and inter-observer reproducibility (Cohen’s kappa, intraclass correlation coefficient) and their predictive intervals (PI) were calculated using mixed-effect models when heterogeneity was high (I2 > 50%). The literature search identified 433 studies, and 23 met the inclusion criteria. The mean error meta-estimate (mixed effects model) was 0.08 years (95% CI − 0.12; 0.29) in males and 0.09 (95% CI − 0.12; 0.30) in females. The PI of each method spanned zero; of nine reported estimation methods, Cameriere’s had the smallest (− 0.82; 0.47) and Haavikko’s the largest (− 7.24; 4.57) PI. The reproducibility meta-estimate (fixed effects model) was 0.98 (95% CI 0.97; 1.00) for intra- and 0.99 (95% CI 0.98; 1.00) for inter-observer agreement. All methods were valid but with different levels of precision. The intra- and inter-observer reproducibility was high and homogeneous across studies.

Sixteen studies provided complete data for both mean errors and examiner agreements, while eight studies report mean errors in age estimation without complete or usable data regarding the intra-or inter-observer agreement. The precision of the estimation methods was highly variable, with a mean error ranging from a maximum precision of − 0.02 years using the Cameriere method applied to males 43 to a minimum of − 2.96 years using the Haavikko method applied to females 50 . The inter-examiner agreement ranged between 0.73 and 1 for Cohen's k/ Fleiss' k and between 0.84 and 1 for ICC; similarly, the intra-examiner agreement ranged between 0.82 and 0.99 for Cohen's k and between 0.80 and 1 for ICC. Study quality assessment (qualitative synthesis). The risk of bias assessment for the selected studies is presented in Table 2 and illustrated in Fig. 2. All studies accurately described the patient selection procedure except for El Bakary et al. 49 and Javadinejad et al. 53 , in which the procedure was not clearly explained, and Franco et al. 44 , in which the criteria were not reported, so these studies were classified as "unclear". With respect to www.nature.com/scientificreports/ the index text, we considered any study that clearly expressed the method of analysis of the radiographs or the experience or number of observers making the measurements as "low" risk. Three studies 49,53,59 did not provide enough information, while another study was not completely specific 63 . Four studies 44,57,59,63 did not report how the chronological age was assessed (the reference standard in Fig. 2), and this was interpreted as a risk of bias since a person could be confused or lie about his age. All studies provided good information on flow and timing. Despite the possibility of bias, no study had applicability concerns. All articles met the minimum criterion of regularity in the procedures, as defined by the PICOS/PECOS strategy 66 , and therefore were included in the analysis.
Meta-analysis of age estimation validity. Since we found only two studies based on bone maturation indices, we did not produce a meta-estimate of the mean error. Concerning the age estimation validity based on dental maturation indices, significant heterogeneity was found for both males and females (males: I 2  due to the large sample size and the precision of the included studies. As a result, a mixed-effects model was applied to calculate the pooled mean error of age estimation by sex. The pooled male mean error of the age prediction was 0.08 years (95% CI − 0.12; 0.29), and the pooled female mean error was 0.09 years (95% CI − 0.12; 0.30). Figure 3 shows the stratification by age estimation methods, which are also summarized in Supplementary Methods 1.
Studies that implemented Nolla's method had a mean error closest to zero with a slight overestimation: mean male age prediction error of 0.02 (95% CI − 0.37; 0.41) and mean female age prediction error of 0.03 (95% CI − 0.34; 0.41). Haavikko's method was a less accurate method, with a mean error of − 1.12 (95% CI − 2.29; 0.06) and − 1.33 (95% CI − 2.54; − 0.13) for males and females, respectively. Cameriere's method also underestimated  For both males and females, the PI overlapped zero for all methods, rendering the difference between estimated and chronological ages not statistically significant. For both genders, Cameriere's method showed the smallest PI, while the Haavikko and other methods had the widest intervals.

Meta-analysis of intra-and inter-examiner agreement. It was not possible to obtain a pooled
Cohen's k (or Fleiss' k) due to a lack of information on the standard error or variance in the examined studies. Therefore, we compared only studies with ICCs and the studies reporting only the global reliability without Table 1. The studies included in the meta-analysis. All studies reported the type of examination as "orthopantomography" except two 52,54 . *Wrist and hand X-ray; §The ICC variance was estimated using the formula reported in Noble et al. 65 .  www.nature.com/scientificreports/ stratification by gender. The meta-analytic pooled estimates of inter-examiner and intra-examiner agreement are summarized in Fig. 4.
No heterogeneity was observed in inter-examiner (heterogeneity: Q = 5.78, p = 0.888) and intra-examiner (heterogeneity: Q = 9.11, p = 0.611) agreement, so a fixed-effects model was used. For inter-examiner agreement, the ICCs ranged from 0.89 to 0.99, and the meta-analytic pooled ICC was 0.98 (95% CI 0.97; 1.00), which was close to perfect reliability. Concerning intra-examiner agreement, the ICCs ranged from 0.90 to 1.00, and the meta-analytic pooled ICC was 0.99 (95% CI 0.98; 1.00), which was also close to perfect reliability.

Discussion
Age estimation represents one of the most important aspects of dental/skeletal analysis and forensic anthropology, playing a key role in human identification, both in living subjects and to establish identity in human remains 1,2,29 . This meta-analysis provides a comprehensive overview of the current literature on the validity of age estimation methods and reproducibility of maturity indices, in particular those based on dental maturation. Although bone age has been widely used, we found only 2 validation studies on methods based on bone maturity indices that met our inclusion criteria. This low frequency could be due to the evidence that bone maturity indices suffer more from environmental factors than dental ones 23 and therefore it could be proper to validate each index only in the population in which it is built. The 21 studies on dental maturity indices identified were conducted in different countries with the aim of validating specific methods of age estimation in specific populations. Although the age estimation methods were applied to different populations, the meta-analysis results, stratified by gender and methods, showed similar accuracy. In fact, for both males and females, the prediction intervals obtained for each method spanned zero, indicating that, despite the different prediction intervals and different target populations, all methods can be considered accurate. Significant heterogeneity between studies was observed for both genders as a consequence of the large sample size of the studies and hence of the high level of precision of error estimates. Using a meta-regression model, we investigated whether this heterogeneity might be further explained by differences in characteristics of the studies or study populations such as type of method, publication year, ethnicity, mean age of the study sample, and impact factor of the journal; the I 2 index still remain very high (99.2%) for both genders (data not shown). The strategies adopted to take into account the heterogeneity between the studies are the estimation of random-effect models and the estimation of prediction intervals to detect a range in which the validity of further studies is expected to be included based on current evidence 67 .
The studies that validated Nolla's method had a mean age estimation error closest to zero for both males (0.02 years) and females (0.04 years), while Cameriere's method had the narrowest prediction interval (male PI [− 1.07; 0.63]; female PI [− 0.82; 0.47]). Of the selected studies, Demirjian's method and its revisited version by Willems were the most frequently used methods for age estimation due to their ease of use, high reproducibility, and accuracy. Both methods tended to overestimate chronological age in males and females, but Willems' method had a narrow prediction interval, between − 0.95 and 1. The Haavikko method had the highest variability, with a prediction interval ranging from − 6.88 to 4.65 for males and from − 7.24 to 4.57 for females. This might be due to the variability in dental maturation among subjects of different ethnic origin 1 , since Haavikko's method is calibrated on Finnish children, whose dental maturation seems to occur earlier 68 . Recently, Butti et al. 69 and Mohammed et al. 50 reached the same conclusion that Haavikko's method is unsuitable for both Italian and Indian children.
With respect to method reliability, our results showed pooled estimates of reproducibility values close to perfect reliability (about unity), indicating that the methods are highly repeatable by expert examiners. This high reproducibility might be due to positive publication bias, as studies reporting good reliability are more frequently available in the literature than studies reporting poor or no reliability 70   www.nature.com/scientificreports/ The strengths of our research are the adequate number of studies included, the precision of pooled mean errors, and the comprehensive evaluation of all methods and indices based on dental maturity for which, respectively, the validity and reproducibility measures were available in literature. To our best knowledge, this is the first meta-analysis that simultaneously evaluated the validity of dental age estimation methods and reproducibility of maturity dental indices, thereby allowing more informed and safer choices in all medical and legal fields requiring these methods. Finally, the quality assessment of the selected studies was very high: only 10% of studies had an unclear risk or high risk of bias without any concerns about applicability.
However, our evaluation also has some limitations and it shows a partial picture of validity and reproducibility of age estimation methods, due to the strict exclusion criteria applied in order to provide unbiased metaestimates. We excluded articles without information on both validity and reproducibility outcomes, articles not written in English or Italian, and those where it was impossible to obtain pooled reproducibility estimates of Cohen's kappa or the ICC due to a lack of information on the variability measure. In addition, some studies used inappropriate methods to estimate reproducibility, as discussed in Ferrante et al. 71 . Lastly, after reading the full texts, we excluded several studies (31 out of 75) with the word "validation" in the title or abstract but that used an inadequate approach to validate the method or no validation at all. www.nature.com/scientificreports/ In conclusion, since only two studies based on bone maturity indices reported the validation and reproducibility analysis, it was not possible to perform a meta-analysis for them. All studies reporting methods based on dental maturity indices, which underwent a validation process, were considered in this review and for each method the difference between estimated and chronological age was not significantly different from zero years, highlighting a high validity. Nevertheless, there was a high degree of variability in the precision of the prediction intervals (research focus 1; Supplementary material "Methods"). Furthermore, a high intra-and inter-observer reproducibility of dental maturity indices was observed (research focus 2; Supplementary material "Methods"). The Nolla and Cameriere methods might be recommended as preferred approaches, although the Cameriere method was validated on a smaller sample size than Nolla's and it requires further testing on additional populations to better assess the mean error estimates by sex. In the development of new methods of age estimation, it will be important to apply rigorous validation and publish a minimum dataset that ensures comparability of validity and reliability between different studies.